# standard imports
import pandas as pd
import numpy as np
# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', None)
First, import the data from the survivor.xlsx file, calling the respective DataFrames the same as the sheet name but with lowercase and snake case. For example, the sheet called Castaway Details should be saved as a DataFrame called castaway_details. Make sure that the data files are in the same folder as your notebook.
Note: You may or may not need to install openpyxl for the code below to work. You can use: $ pip install openpyxl
#pip install openpyxl
# import data from Excel
# setup Filename and Object
fileName = "survivor.xlsx"
xls = pd.ExcelFile(fileName)
# import individual sheets /// next will use copies of each df
castaway_details1 = pd.read_excel(xls, 'Castaway Details')
castaways1 = pd.read_excel(xls, 'Castaways')
challenge_description1 = pd.read_excel(xls, 'Challenge Description')
challenge_results1 = pd.read_excel(xls, 'Challenge Results')
confessionals1 = pd.read_excel(xls, 'Confessionals')
hidden_idols1 = pd.read_excel(xls, 'Hidden Idols')
jury_votes1 = pd.read_excel(xls, 'Jury Votes')
tribe_mapping1 = pd.read_excel(xls, 'Tribe Mapping')
viewers1 = pd.read_excel(xls, 'Viewers')
vote_history1 = pd.read_excel(xls, 'Vote History')
season_summary1 = pd.read_excel(xls, 'Season Summary')
season_palettes1 = pd.read_excel(xls, 'Season Palettes')
tribe_colours1 = pd.read_excel(xls, 'Tribe Colours')
#Copies of the original dataframes
castaway_details = castaway_details1.copy()
castaways = castaways1.copy()
challenge_description = challenge_description1.copy()
challenge_results = challenge_results1.copy()
confessionals = confessionals1.copy()
hidden_idols = hidden_idols1.copy()
jury_votes = jury_votes1.copy()
tribe_mapping = tribe_mapping1.copy()
viewers = viewers1.copy()
vote_history = vote_history1.copy()
season_summary = season_summary1.copy()
season_palettes = season_palettes1.copy()
tribe_colours = tribe_colours1.copy()
## Checking for the shape of each dataframes
print('castaway_details', castaway_details.shape)
print('castaways', castaways.shape)
print('challenge_description', challenge_description.shape)
print('challenge_results', challenge_results.shape)
print('confessionals', confessionals.shape)
print('hidden_idols', hidden_idols.shape)
print('jury_votes',jury_votes.shape)
print('tribe_mapping', tribe_mapping.shape)
print('viewers', viewers.shape)
print('vote_history', vote_history.shape)
print('season_summary', season_summary.shape)
print('season_palettes', season_palettes.shape)
print('tribe_colours', tribe_colours.shape)
castaway_details (608, 10) castaways (762, 20) challenge_description (892, 14) challenge_results (4441, 13) confessionals (6684, 6) hidden_idols (159, 10) jury_votes (933, 7) tribe_mapping (7322, 8) viewers (610, 9) vote_history (4751, 15) season_summary (41, 20) season_palettes (205, 3) tribe_colours (145, 5)
Exercise1: Change every column name of every DataFrame to lowercase and snake case. This is a standard first step for some programmers as lowercase makes it easier to write and snake case makes it easier to copy multiple-word column names.
For example, Castaway Id should end up being castaway_id. You should try doing this using a for loop instead of manually changing the names for each column. It should take you no more than a few lines of code. Use stackoverflow if you need help.
for columns in castaway_details.columns:
castaway_details.columns = castaway_details.columns.str.lower()
castaway_details.columns = castaway_details.columns.str.replace(' ','_')
castaway_details.head()
| castaway_id | full_name | short_name | date_of_birth | date_of_death | gender | race | ethnicity | occupation | personality_type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Sonja Christopher | Sonja | 1937-01-28 | NaT | Female | NaN | NaN | Musician | ENFP |
| 1 | 2 | B.B. Anderson | B.B. | 1936-01-18 | 2013-10-29 | Male | NaN | NaN | Real Estate Developer | ESTJ |
| 2 | 3 | Stacey Stillman | Stacey | 1972-08-11 | NaT | Female | NaN | NaN | Attorney | ENTJ |
| 3 | 4 | Ramona Gray | Ramona | 1971-01-20 | NaT | Female | Black | NaN | Biochemist/Chemist | ISTJ |
| 4 | 5 | Dirk Been | Dirk | 1976-06-15 | NaT | Male | NaN | NaN | Dairy Farmer | ISFP |
for columns in castaways.columns:
castaways.columns = castaways.columns.str.lower()
castaways.columns = castaways.columns.str.replace(' ','_')
castaways.head()
| season_name | season | full_name | castaway_id | castaway | age | city | state | personality_type | episode | day | order | result | jury_status | original_tribe | swapped_tribe | swapped_tribe_2 | merged_tribe | total_votes_received | immunity_idols_won | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Survivor: 41 | 41 | Erika Casupanan | 594 | Erika | 32 | Toronta | Ontario | ENFP | 13 | 26 | 18 | Sole Survivor | NaN | Luvu | NaN | NaN | Via Kana | 2 | 8 |
| 1 | Survivor: 41 | 41 | Deshawn Radden | 601 | Deshawn | 26 | Miami | Florida | ENTP | 13 | 26 | 17 | Runner-up | NaN | Luvu | NaN | NaN | Via Kana | 7 | 6 |
| 2 | Survivor: 41 | 41 | Xander Hastings | 597 | Xander | 20 | Chicago | Illinois | INFJ | 13 | 26 | 16 | 2nd runner-up | NaN | Yase | NaN | NaN | Via Kana | 2 | 6 |
| 3 | Survivor: 41 | 41 | Heather Aldret | 593 | Heather | 52 | Charleston | South Carolina | ISFJ | 13 | 25 | 15 | 15th voted out | 8th jury member | Luvu | NaN | NaN | Via Kana | 4 | 6 |
| 4 | Survivor: 41 | 41 | Ricard Foye | 596 | Ricard | 31 | Sedro-Woolley | Washington | ENTJ | 13 | 24 | 14 | 14th voted out | 7th jury member | Ua | NaN | NaN | Via Kana | 9 | 5 |
for columns in challenge_description.columns:
challenge_description.columns = challenge_description.columns.str.lower()
challenge_description.columns = challenge_description.columns.str.replace(' ','_')
challenge_description.head()
| challenge_id | challenge_name | puzzle | race | precision | endurance | strength | turn_based | balance | food | knowledge | memory | fire | water | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CH0001 | Quest for Fire | False | True | False | False | False | False | False | False | False | False | True | True |
| 1 | CH0002 | Bridging the Gap | False | True | False | False | False | False | False | False | False | False | True | True |
| 2 | CH0003 | Trail Blazer | False | True | False | False | False | False | False | False | False | False | True | False |
| 3 | CH0004 | Buggin' Out | False | False | False | False | False | False | False | True | False | False | False | False |
| 4 | CH0005 | Tucker'd Out | False | True | True | False | False | False | False | True | False | False | False | False |
for columns in challenge_results.columns:
challenge_results.columns = challenge_results.columns.str.lower()
challenge_results.columns = challenge_results.columns.str.replace(' ','_')
challenge_results.head()
| season_name | season | episode | day | episode_title | challenge_name | challenge_type | outcome_type | challenge_id | winner_id | winner | winning_tribe | outcome_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 2.0 | B.B. | Pagong | Winner |
| 1 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 4.0 | Ramona | Pagong | Winner |
| 2 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 6.0 | Joel | Pagong | Winner |
| 3 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 7.0 | Gretchen | Pagong | Winner |
| 4 | Survivor: Borneo | 1 | 1 | 3 | The Marooning | Quest for Fire | Reward and Immunity | Tribal | CH0001 | 8.0 | Greg | Pagong | Winner |
for columns in confessionals.columns:
confessionals.columns = confessionals.columns.str.lower()
confessionals.columns = confessionals.columns.str.replace(' ','_')
confessionals.head()
| season_name | season | episode | castaway | castaway_id | confessional_count | |
|---|---|---|---|---|---|---|
| 0 | Survivor: 41 | 41 | 1 | JD | 603 | 11 |
| 1 | Survivor: 41 | 41 | 1 | Evvie | 598 | 9 |
| 2 | Survivor: 41 | 41 | 1 | Danny | 599 | 8 |
| 3 | Survivor: 41 | 41 | 1 | Xander | 597 | 5 |
| 4 | Survivor: 41 | 41 | 1 | Deshawn | 601 | 4 |
for columns in hidden_idols.columns:
hidden_idols.columns = hidden_idols.columns.str.lower()
hidden_idols.columns = hidden_idols.columns.str.replace(' ','_')
hidden_idols.head()
| season_name | season | castaway_id | castaway | idol_number | idols_held | votes_nullified | day_found | day_played | legacy_advantage | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Survivor: Guatemala | 11 | 161 | Gary | 1 | 1 | 0.0 | 24.0 | NaN | False |
| 1 | Survivor: Panama | 12 | 180 | Terry | 1 | 1 | 0.0 | NaN | NaN | False |
| 2 | Survivor: Cook Islands | 13 | 202 | Yul | 1 | 1 | 0.0 | NaN | NaN | False |
| 3 | Survivor: Fiji | 14 | 218 | Yau-Man | 1 | 1 | 4.0 | 17.0 | 36.0 | False |
| 4 | Survivor: Fiji | 14 | 214 | Mookie | 1 | 1 | 0.0 | 20.0 | NaN | False |
for columns in jury_votes.columns:
jury_votes.columns = jury_votes.columns.str.lower()
jury_votes.columns = jury_votes.columns.str.replace(' ','_')
jury_votes.head()
| season_name | season | castaway | finalist | vote | castaway_id | finalist_id | |
|---|---|---|---|---|---|---|---|
| 0 | Survivor: 41 | 41 | Heather | Deshawn | 0 | 593 | 601 |
| 1 | Survivor: 41 | 41 | Ricard | Deshawn | 0 | 596 | 601 |
| 2 | Survivor: 41 | 41 | Danny | Deshawn | 1 | 599 | 601 |
| 3 | Survivor: 41 | 41 | Liana | Deshawn | 0 | 608 | 601 |
| 4 | Survivor: 41 | 41 | Shan | Deshawn | 0 | 606 | 601 |
for columns in tribe_mapping.columns:
tribe_mapping.columns = tribe_mapping.columns.str.lower()
tribe_mapping.columns = tribe_mapping.columns.str.replace(' ','_')
tribe_mapping.head()
| season_name | season | episode | day | castaway_id | castaway | tribe | tribe_status | |
|---|---|---|---|---|---|---|---|---|
| 0 | Survivor: Borneo | 1 | 1 | 3 | 2.0 | B.B. | Pagong | Original |
| 1 | Survivor: Borneo | 1 | 1 | 3 | 4.0 | Ramona | Pagong | Original |
| 2 | Survivor: Borneo | 1 | 1 | 3 | 6.0 | Joel | Pagong | Original |
| 3 | Survivor: Borneo | 1 | 1 | 3 | 7.0 | Gretchen | Pagong | Original |
| 4 | Survivor: Borneo | 1 | 1 | 3 | 8.0 | Greg | Pagong | Original |
for columns in viewers.columns:
viewers.columns = viewers.columns.str.lower()
viewers.columns = viewers.columns.str.replace(' ','_')
viewers.head()
| season_name | season | episode_number_overall | episode | episode_title | episode_date | viewers | rating_18_49 | share_18_49 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Survivor: 41 | 41 | 597.0 | 1 | A New Era | 2021-09-22 | 6.25 | 1.1 | 8.0 |
| 1 | Survivor: 41 | 41 | 598.0 | 2 | Juggling Chainsaws | 2021-09-29 | 5.90 | 1.0 | 7.0 |
| 2 | Survivor: 41 | 41 | 599.0 | 3 | My Million Dollar Mistake | 2021-10-06 | 5.79 | 0.9 | 6.0 |
| 3 | Survivor: 41 | 41 | 600.0 | 4 | They Hate Me Because They Ain't Me | 2021-10-13 | 5.68 | 0.9 | 7.0 |
| 4 | Survivor: 41 | 41 | 601.0 | 5 | The Strategist or the Loyalist | 2021-10-20 | 5.62 | 1.0 | 7.0 |
for columns in vote_history.columns:
vote_history.columns = vote_history.columns.str.lower()
vote_history.columns = vote_history.columns.str.replace(' ','_')
vote_history.head()
| season_name | season | episode | day | tribe_status | castaway | immunity | vote | nullified | voted_out | order | vote_order | castaway_id | vote_id | voted_out_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Survivor: 41 | 41 | 1 | 3 | Original | Evvie | NaN | Abraham | False | Abraham | 1 | 1 | 598 | 591.0 | 591.0 |
| 1 | Survivor: 41 | 41 | 1 | 3 | Original | Liana | NaN | Abraham | False | Abraham | 1 | 1 | 608 | 591.0 | 591.0 |
| 2 | Survivor: 41 | 41 | 1 | 3 | Original | Tiffany | NaN | Abraham | False | Abraham | 1 | 1 | 604 | 591.0 | 591.0 |
| 3 | Survivor: 41 | 41 | 1 | 3 | Original | Voce | NaN | Abraham | False | Abraham | 1 | 1 | 607 | 591.0 | 591.0 |
| 4 | Survivor: 41 | 41 | 1 | 3 | Original | Xander | NaN | Abraham | False | Abraham | 1 | 1 | 597 | 591.0 | 591.0 |
for columns in season_summary.columns:
season_summary.columns = season_summary.columns.str.lower()
season_summary.columns = season_summary.columns.str.replace(' ','_')
season_summary.head()
| season_name | season | location | country | tribe_setup | full_name | winner_id | winner | runner_ups | final_vote | timeslot | premiered | ended | filming_started | filming_ended | viewers_premier | viewers_finale | viewers_reunion | viewers_mean | rank | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Survivor: Borneo | 1 | Pulau Tiga, Sabah, Malaysia | Malaysia | Two tribes of eight new players | Richard Hatch | 16 | Richard | Kelly Wiglesworth | 4-3 | Wednesday 8:00 pm | 2000-05-31 | 2000-08-23 | 2000-03-13 | 2000-04-20 | 15.51 | 51.69 | 36.70 | 28.30 | 2.0 |
| 1 | Survivor: The Australian Outback | 2 | Herbert River at Goshen Station, Queensland, A... | Australia | Two tribes of eight new players | Tina Wesson | 32 | Tina | Colby Donaldson | 4-3 | Thursday 8:00 pm | 2001-01-28 | 2001-05-03 | 2000-10-23 | 2000-12-03 | 45.37 | 36.35 | 28.01 | 29.80 | 1.0 |
| 2 | Survivor: Africa | 3 | Shaba National Reserve, Kenya | Kenya | Two tribes of eight new players | Ethan Zohn | 48 | Ethan | Kim Johnson | 5-2 | Thursday 8:00 pm | 2001-10-11 | 2002-01-10 | 2001-07-11 | 2001-08-18 | 23.84 | 27.26 | 19.05 | 20.69 | 8.0 |
| 3 | Survivor: Marquesas | 4 | Nuku Hiva, Marquesas Islands, French Polynesia | Polynesia | Two tribes of eight new players | Vecepia Towery | 64 | Vecepia | Neleh Dennis | 4-3 | Thursday 8:00 pm | 2002-02-28 | 2002-05-19 | 2001-11-12 | 2001-12-20 | 23.19 | 25.87 | 19.05 | 20.77 | 6.0 |
| 4 | Survivor: Thailand | 5 | Ko Tarutao, Satun Province, Thailand | Thailand | Two tribes of eight new players; picked by the... | Brian Heidik | 80 | Brian | Clay Jordan | 4-3 | Thursday 8:00 pm | 2002-09-19 | 2002-12-19 | 2002-06-10 | 2002-07-18 | 23.05 | 24.08 | 20.43 | 21.21 | 4.0 |
for columns in season_palettes.columns:
season_palettes.columns = season_palettes.columns.str.lower()
season_palettes.columns = season_palettes.columns.str.replace(' ','_')
season_palettes.head()
| season_name | season | palette | |
|---|---|---|---|
| 0 | Survivor: 41 | 41 | #ABC9E4 |
| 1 | Survivor: 41 | 41 | #6990A7 |
| 2 | Survivor: 41 | 41 | #1D4164 |
| 3 | Survivor: 41 | 41 | #F3A92B |
| 4 | Survivor: 41 | 41 | #E8E086 |
for columns in tribe_colours.columns:
tribe_colours.columns = tribe_colours.columns.str.lower()
tribe_colours.columns = tribe_colours.columns.str.replace(' ','_')
tribe_colours.head()
| season_name | season | tribe | tribe_colour | tribe_status | |
|---|---|---|---|---|---|
| 0 | Survivor: Borneo | 1 | Pagong | #FFFF05 | original |
| 1 | Survivor: Borneo | 1 | Rattana | #7CFC00 | merged |
| 2 | Survivor: Borneo | 1 | Tagi | #FF9900 | original |
| 3 | Survivor: The Australian Outback | 2 | Barramundi | #FF6600 | merged |
| 4 | Survivor: The Australian Outback | 2 | Kucha | #32CCFF | original |
Q2: Who is the oldest contestant that is still alive? We want to look at their current age and NOT the age they were when they played their season. Select their row from the castaway_details DataFrame and save this as Q2. This should return a DataFrame and the index and missing values should be left as is.
Important: Remember to show your work (i.e. how you found this answer by using Python/Pandas/Numpy). See note in the instructions above. This is true for all of the following questions as well.
### ENTER CODE HERE ###
Q2 = castaway_details[pd.isnull(castaway_details['date_of_death'])]
Q2 = Q2[Q2['date_of_birth'] == Q2['date_of_birth'].min()]
Q2
| castaway_id | full_name | short_name | date_of_birth | date_of_death | gender | race | ethnicity | occupation | personality_type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Sonja Christopher | Sonja | 1937-01-28 | NaT | Female | NaN | NaN | Musician | ENFP |
Q3: What contestant was the oldest at the time of their season? We want to look at their age at the time of the season and NOT their current age. Select their row from the castaway_details DataFrame and save this as Q3. This should return a DataFrame and the index and missing values should be left as is.
x = castaways.groupby(['season','castaway_id'])['age'].max()
y = x.loc[x.groupby(level=0).idxmax()]
y = y.to_frame().reset_index()
y = y['castaway_id'].to_list()
Q3 = castaway_details[castaway_details['castaway_id'].isin(y)]
print(Q3)
castaway_id full_name short_name date_of_birth date_of_death \
13 14 Rudy Boesch Rudy 1928-01-20 2019-11-01
27 28 Rodger Bingham Rodger 1947-07-05 NaT
31 32 Tina Wesson Tina 1960-12-26 NaT
46 47 Kim Johnson Kim J. 1944-09-18 NaT
60 61 Paschal English Paschal 1945-03-05 NaT
74 75 Jake Billingsley Jake 1941-08-21 NaT
86 87 Roger Sexton Roger 1946-09-26 NaT
98 99 Lillian Morris Lillian 1952-04-03 NaT
127 128 Scout Cloud Lee Scout 1944-11-08 NaT
136 137 Willard Smith Willard 1947-11-30 NaT
150 151 Jim Lynch Jim 1942-01-07 NaT
175 176 Bruce Kanegai Bruce 1947-12-17 NaT
179 180 Terry Deitz Terry 1959-10-10 NaT
182 183 Sekou Bunch Sekou 1960-09-09 NaT
205 206 Gary Stritesky Gary 1951-09-16 NaT
217 218 Yau-Man Chan Yau-Man 1952-08-26 NaT
221 222 Steve Morris Chicken 1959-09-02 NaT
248 249 Gillian Larson Gillian 1947-02-24 NaT
257 258 Randy Bailey Randy 1959-03-31 NaT
268 269 Sandy Burgin Sandy 1955-07-01 NaT
282 283 Mike Borassi Mike 1947-03-13 NaT
303 304 Jimmy Johnson Jimmy J. 1943-07-16 NaT
336 337 Phillip Sheppard Phillip 1958-03-12 NaT
350 351 Rick Nelson Rick 1959-10-24 NaT
363 364 Troy Robertson Troyzan 1961-07-30 NaT
365 366 Greg Smith Tarzan 1947-12-12 NaT
379 380 Artis Silvester Artis 1959-04-18 NaT
385 386 Denise Stapley Denise 1971-01-01 NaT
419 420 Trish Hegarty Trish 1965-10-09 NaT
429 430 Dale Wentworth Dale 1959-03-07 NaT
458 459 Carolyn Rivera Carolyn 1961-10-07 NaT
473 474 Joseph Del Campo Joe 1943-07-04 NaT
480 481 Paul Wachter Paul 1963-10-10 NaT
498 499 Katrina Radke Katrina 1970-12-17 NaT
532 533 Angela Perkins Angela 1974-11-02 NaT
541 542 Natalie Cole Natalie 1961-02-26 NaT
556 557 Reem Daly Reem 1971-07-08 NaT
574 575 Tom Laidlaw Tom 1958-04-15 NaT
592 593 Heather Aldret Heather 1969-03-01 NaT
gender race ethnicity occupation \
13 Male NaN NaN Retired Navy SEAL
27 Male NaN NaN Teacher/Farmer
31 Female NaN NaN Personal Nurse;Motivational Speaker
46 Female NaN NaN Retired Teacher
60 Male NaN NaN Judge
74 Male NaN NaN Land Broker
86 Male NaN NaN Construction Company V.P.
98 Female NaN NaN Scout Troop Leader
127 Female NaN NaN Rancher
136 Male NaN NaN Lawyer
150 Male NaN NaN Retired Fire Captain
175 Male Asian Japanese American Art Teacher
179 Male NaN NaN Pilot
182 Male Black NaN Jazz Musician
205 Male NaN NaN School Bus Driver
217 Male Asian Bornean American Computer Engineer
221 Male NaN NaN Chicken Farmer
248 Female NaN NaN Retired Nurse
257 Male NaN NaN Wedding Videographer
268 Female NaN NaN Bus Driver
282 Male NaN NaN Personal Chef
303 Male NaN NaN Former NFL Coach
336 Male Black NaN Former Federal Agent;Software Sales
350 Male NaN NaN Rancher
363 Male NaN NaN Swimsuit Photographer
365 Male NaN NaN Plastic Surgeon
379 Male Black NaN Computer Engineer
385 Female NaN NaN Sex Therapist
419 Male NaN NaN Pilates Trainer
429 Male NaN NaN Farmer
458 Female NaN NaN Corporate Executive
473 Male NaN NaN Former FBI Agent
480 Male NaN NaN Boat Mechanic
498 Female NaN NaN Olympian
532 Female NaN NaN Army Veteran
541 Female Black NaN Publishing CEO
556 Female NaN NaN Sales Representative
574 Male NaN NaN Former NHL Player
592 Female NaN NaN Stay-at-home Mom
personality_type
13 ISTJ
27 ISFJ
31 ESFJ
46 ISFJ
60 ISFJ
74 ISFJ
86 ISTJ
98 INFP
127 INFJ
136 INTP
150 ISTJ
175 INFP
179 ESTJ
182 ESFJ
205 ISFJ
217 INTP
221 ISFP
248 ENFP
257 INTJ
268 ESFP
282 ESTP
303 ESFJ
336 ESTJ
350 ISFJ
363 ESFP
365 INTP
379 ISTJ
385 INFJ
419 ENFJ
429 ISTJ
458 ENTP
473 ISTJ
480 ESTJ
498 ENFP
532 ISTJ
541 ESTJ
556 ESTP
574 ESFJ
592 ISFJ
Q4: What contestant played in the most number of seasons? Select their row from the castaway_details DataFrame and save this as Q4. This should return a DataFrame and the index and missing values should be left as is.
combined1 = castaway_details.merge(castaways, how='right', left_on= 'castaway_id', right_on='castaway_id')
combined3 = combined1.groupby(['castaway_id','full_name_x'],as_index=False)[['season']].nunique().sort_values(by = 'season', ascending = False)
combined3 = combined3.reset_index(drop = True)
combined3
most_played = combined3['castaway_id'].head(1).to_list()
Q4 = castaway_details[castaway_details['castaway_id'].isin(most_played)]
print(Q4)
castaway_id full_name short_name date_of_birth date_of_death gender \ 54 55 Rob Mariano Boston Rob 1975-12-25 NaT Male race ethnicity occupation personality_type 54 NaN NaN Construction Worker ESTJ
Q5: Create a DataFrame of all the contestants that won their season (aka their final result in the castaways DataFrame was the 'Sole Survivor'). Call this DataFrame sole_survivor. Note that contestants may appear more than one time in this DataFrame if they won more than one season. Make sure that the index goes from 0 to n-1 and that the DataFrame is sorted ascending by season number.
The DataFrame should have the same columns, and the columns should be in the same order, as the castaways DataFrame.
sole_survivor = castaways[castaways['result'] == 'Sole Survivor'].sort_values(by = 'season')
print(sole_survivor)
season_name season full_name \
746 Survivor: Borneo 1 Richard Hatch
730 Survivor: The Australian Outback 2 Tina Wesson
714 Survivor: Africa 3 Ethan Zohn
698 Survivor: Marquesas 4 Vecepia Towery
682 Survivor: Thailand 5 Brian Heidik
666 Survivor: The Amazon 6 Jenna Morasca
648 Survivor: Pearl Islands 7 Sandra Diaz-Twine
630 Survivor: All-Stars 8 Amber Brkich
612 Survivor: Vanuatu 9 Chris Daugherty
592 Survivor: Palau 10 Tom Westman
574 Survivor: Guatemala 11 Danni Boatwright
558 Survivor: Panama 12 Aras Baskauskas
538 Survivor: Cook Islands 13 Yul Kwon
519 Survivor: Fiji 14 Earl Cole
503 Survivor: China 15 Todd Herzog
483 Survivor: Micronesia 16 Parvati Shallow
465 Survivor: Gabon 17 Robert Crowley
449 Survivor: Tocantins 18 James Thomas Jr.
429 Survivor: Samoa 19 Natalie White
409 Survivor: Heroes vs. Villains 20 Sandra Diaz-Twine
389 Survivor: Nicaragua 21 Jud Birza
369 Survivor: Redemption Island 22 Rob Mariano
349 Survivor: South Pacific 23 Sophie Clarke
331 Survivor: One World 24 Kim Spradlin
313 Survivor: Philippines 25 Denise Stapley
293 Survivor: Caramoan 26 John Cochran
270 Survivor: Blood vs. Water 27 Tyson Apostol
252 Survivor: Cagayan 28 Tony Vlachos
234 Survivor: San Juan del Sur 29 Natalie Anderson
216 Survivor: Worlds Apart 30 Mike Holloway
196 Survivor: Cambodia 31 Jeremy Collins
178 Survivor: Kaoh Rong 32 Michele Fitzgerald
158 Survivor: Millennials vs. Gen X 33 Adam Klein
138 Survivor: Game Changers 34 Sarah Lacina
120 Survivor: Heroes vs. Healers vs. Hustlers 35 Ben Driebergen
100 Survivor: Ghost Island 36 Wendell Holland
80 Survivor: David vs. Goliath 37 Nick Wilson
60 Survivor: Edge of Extinction 38 Chris Underwood
40 Survivor: Island of the Idols 39 Tommy Sheehan
18 Survivor: Winners at War 40 Tony Vlachos
0 Survivor: 41 41 Erika Casupanan
castaway_id castaway age city state \
746 16 Richard 39 Newport Rhode Island
730 32 Tina 40 Knoxville Tennessee
714 48 Ethan 27 Lexington Massachusetts
698 64 Vecepia 36 Hayward California
682 80 Brian 34 Quartz Hill California
666 96 Jenna 21 Bridgeville Pennsylvania
648 112 Sandra 29 Fort Lewis Washington
630 27 Amber 25 Beaver Pennsylvania
612 130 Chris 33 South Vienna Ohio
592 150 Tom 40 Sayville New York
574 166 Danni 29 Tonganoxie Kansas
558 182 Aras 24 Santa Monica California
538 202 Yul 31 San Mateo California
519 221 Earl 35 Santa Monica California
503 237 Todd 22 Pleasant Grove Utah
483 197 Parvati 25 Los Angeles California
465 265 Bob 57 South Portland Maine
449 281 J.T. 24 Mobile Alabama
429 301 Natalie 26 Van Buren Arkansas
409 112 Sandra 35 Fayetteville North Carolina
389 321 Fabio 21 Venice California
369 55 Boston Rob 34 Pensacola Florida
349 353 Sophie 22 Willsboro New York
331 371 Kim 29 San Antonio Texas
313 386 Denise 41 Cedar Rapids Iowa
293 348 Cochran 25 Washington D.C.
270 274 Tyson 34 Provo Utah
252 424 Tony 39 Jersey City New Jersey
234 442 Natalie 28 Edgewater New Jersey
216 460 Mike 38 North Richland Hills Texas
196 433 Jeremy 37 Foxboro Massachusetts
178 478 Michele 24 Freehold New Jersey
158 498 Adam 25 San Francisco California
138 414 Sarah 32 Marion Iowa
120 516 Ben 34 Boise Idaho
100 536 Wendell 33 Philadelphia Pennsylvania
80 556 Nick 27 Williamsburg Kentucky
60 559 Chris 25 Greenville South Carolina
40 590 Tommy 26 Long Beach New York
18 424 Tony 45 Allendale New Jersey
0 594 Erika 32 Toronta Ontario
personality_type episode day order result jury_status \
746 ENTP 14 39 16 Sole Survivor NaN
730 ESFJ 16 42 16 Sole Survivor NaN
714 ISFP 15 39 16 Sole Survivor NaN
698 ISTJ 15 39 16 Sole Survivor NaN
682 ISTP 15 39 16 Sole Survivor NaN
666 ISTP 15 39 16 Sole Survivor NaN
648 ESTP 15 39 18 Sole Survivor NaN
630 ISFP 17 39 18 Sole Survivor NaN
612 ENTP 15 39 18 Sole Survivor NaN
592 ESTJ 15 39 20 Sole Survivor NaN
574 ENFJ 15 39 18 Sole Survivor NaN
558 INFP 16 39 16 Sole Survivor NaN
538 INTJ 16 39 20 Sole Survivor NaN
519 INFJ 15 39 19 Sole Survivor NaN
503 ENFP 15 39 16 Sole Survivor NaN
483 ENFJ 15 39 20 Sole Survivor NaN
465 INTP 14 39 18 Sole Survivor NaN
449 ESTP 15 39 16 Sole Survivor NaN
429 ISFJ 16 39 20 Sole Survivor NaN
409 ESTP 15 39 20 Sole Survivor NaN
389 ESFP 16 39 20 Sole Survivor NaN
369 ESTJ 15 39 20 Sole Survivor NaN
349 INTJ 16 39 20 Sole Survivor NaN
331 INFJ 15 39 18 Sole Survivor NaN
313 INFJ 15 39 18 Sole Survivor NaN
293 INTP 15 39 20 Sole Survivor NaN
270 ESTP 15 39 23 Sole Survivor NaN
252 ESTP 14 39 18 Sole Survivor NaN
234 ESTP 15 39 18 Sole Survivor NaN
216 ESTP 15 39 18 Sole Survivor NaN
196 ESTJ 15 39 20 Sole Survivor NaN
178 ESFP 15 39 18 Sole Survivor NaN
158 ENTP 14 39 20 Sole Survivor NaN
138 ENTJ 14 39 20 Sole Survivor NaN
120 ESFP 14 39 18 Sole Survivor NaN
100 INFJ 14 39 20 Sole Survivor NaN
80 ENTP 14 39 20 Sole Survivor NaN
60 ENTP 14 39 20 Sole Survivor NaN
40 ENFJ 14 39 20 Sole Survivor NaN
18 ESTP 15 39 22 Sole Survivor NaN
0 ENFP 13 26 18 Sole Survivor NaN
original_tribe swapped_tribe swapped_tribe_2 merged_tribe \
746 Tagi NaN NaN Rattana
730 Ogakor NaN NaN Barramundi
714 Boran Boran NaN Moto Maji
698 Maraamu Rotu NaN Soliantu
682 Chuay Gahn NaN NaN Chuay Jai
666 Jaburu Jaburu NaN Jacaré
648 Drake Drake NaN Balboa
630 Chapera Chapera Chapera Chaboga Mogo
612 Lopevi Lopevi NaN Alinta
592 Koror NaN NaN Koror
574 Nakúm Yaxhá NaN Xhakúm
558 Viveros Casaya NaN Gitanos
538 Puka Puka Aitutaki Aitutaki Aitutonga
519 Ravu Moto NaN Bula Bula
503 Fei Long Fei Long NaN Hae Da Fung
483 Malakal Airai NaN Dabu
465 Kota Kota Kota Nobag
449 Jalapao NaN NaN Forza
429 Foa Foa NaN NaN Aiga
409 Villains NaN NaN Yin Yang
389 La Flor La Flor NaN Libertad
369 Ometepe NaN NaN Murlonio
349 Upolu NaN NaN Te Tuna
331 Salani Salani NaN Tikiano
313 Matsing Kalabaw NaN Dangrayne
293 Bikal Bikal NaN Enil Edam
270 Galang Tadhana NaN Kasama
252 Aparri Solana NaN Solarrion
234 Hunahpu Hunahpu NaN Huyopa
216 Escameca Escameca NaN Merica
196 Bayon Bayon Bayon Orkun
178 Gondol Chan Loh NaN Dara
158 Vanua Takali NaN Vinaka
138 Nuku Tavua NaN Maku Maku
120 Levu Yawa NaN Solewa
100 Naviti Naviti Yanuya Lavita
80 David Jabeni NaN Kalokalo
60 Manu NaN NaN Vata
40 Vokai Vokai NaN Lumuwaku
18 Dakal Dakal NaN Koru
0 Luvu NaN NaN Via Kana
total_votes_received immunity_idols_won
746 6 4
730 0 2
714 0 4
698 2 4
682 0 8
666 3 7
648 0 3
630 6 6
612 3 6
592 0 12
574 1 5
558 9 4
538 5 6
519 1 5
503 5 4
483 4 7
465 2 8
449 0 5
429 8 1
409 3 4
389 2 8
369 7 9
349 5 6
331 3 6
313 6 1
293 0 7
270 2 8
252 5 6
234 0 6
216 4 7
196 3 7
178 1 7
158 6 6
138 0 4
120 11 4
100 5 8
80 0 4
60 9 1
40 2 6
18 0 9
0 2 8
Q6: Have any contestants won more than one time? If so, select their records from the sole_survivor DataFrame, sorting the rows by season. Save this as Q6. If no contestant has won twice, save Q6 as the string None.
sole_survivor.groupby('full_name', as_index=False)['season'].count().sort_values(by = 'season', ascending = False)
Q6 = sole_survivor[(sole_survivor['full_name'] == 'Sandra Diaz-Twine') | (sole_survivor['full_name'] == 'Tony Vlachos')].sort_values(by = 'season')
print(Q6)
season_name season full_name castaway_id \
648 Survivor: Pearl Islands 7 Sandra Diaz-Twine 112
409 Survivor: Heroes vs. Villains 20 Sandra Diaz-Twine 112
252 Survivor: Cagayan 28 Tony Vlachos 424
18 Survivor: Winners at War 40 Tony Vlachos 424
castaway age city state personality_type episode \
648 Sandra 29 Fort Lewis Washington ESTP 15
409 Sandra 35 Fayetteville North Carolina ESTP 15
252 Tony 39 Jersey City New Jersey ESTP 14
18 Tony 45 Allendale New Jersey ESTP 15
day order result jury_status original_tribe swapped_tribe \
648 39 18 Sole Survivor NaN Drake Drake
409 39 20 Sole Survivor NaN Villains NaN
252 39 18 Sole Survivor NaN Aparri Solana
18 39 22 Sole Survivor NaN Dakal Dakal
swapped_tribe_2 merged_tribe total_votes_received immunity_idols_won
648 NaN Balboa 0 3
409 NaN Yin Yang 3 4
252 NaN Solarrion 5 6
18 NaN Koru 0 9
Q7: Using value_counts(), what is the normalized relative frequencies (percentage) breakdown of gender for all the contestants that have played before? Count someone who played in multiple seasons only once. Round the results to 3 decimal places.
Q7 = castaway_details[castaway_details['full_name'] == castaway_details['full_name'].unique()]
Q7 = round(Q7['gender'].value_counts(normalize='True'),3)
print(Q7)
Male 0.502 Female 0.497 Non-binary 0.002 Name: gender, dtype: float64
Q8:
Q8A. Q8B. 55.57).combined4 = castaway_details.merge(sole_survivor, how='inner', left_on= 'castaway_id', right_on='castaway_id')
combined4['gender'].value_counts()
Male 25 Female 16 Name: gender, dtype: int64
Q8A = round(25/41*100,2)
Q8B = round(16/41*100,2)
print(Q8A,Q8B)
60.98 39.02
Q9: What is the average age of contestants when they appeared on the show? Save this as Q9. Round to nearest integer.
Q9 = round(castaways['age'].mean(),0)
Q9 = int(Q9)
print(Q9)
33
Q10: Let's say we wanted to analyze the types of occupations that make a good winner on Survivor. Create a DataFrame that includes the occupation of every winner (sorted by season). If a contestant won more than one time, the occupation should only appear once for the first time the contestant was on the show. The DataFrame index will be the respective season number and should be called Q10.
Code Check: The first five rows of the Q10 DataFrame should look similar to this:
| season | occupation |
|---|---|
| 1 | Corporate Trainer |
| 2 | Personal Nurse; Motivational Speaker |
| 3 | Professional Soccer Player;Social Entrepreneur... |
| 4 | Office Manager |
| 5 | Used Car Salesman |
Q10 = combined4.groupby('castaway_id',as_index = False)[['season','occupation']].min()
Q10 = Q10[['season','occupation']].sort_values(by='season').reset_index(drop=True)
print(Q10)
season occupation 0 1 Corporate Trainer 1 2 Personal Nurse;Motivational Speaker 2 3 Professional Soccer Player;Social Entrepreneur... 3 4 Office Manager 4 5 Used Car Salesman 5 6 Swimsuit Model 6 7 Office Assistant;Case Manager 7 8 Administrative Assistant;Director of Marketing... 8 9 Highway Construction Worker 9 10 NYC Firefighter;Motivational Speaker 10 11 Sports Radio Host;Owner of Sideline Chic 11 12 Yoga Instructor;Musician 12 13 Management Consultant;Product Management 13 14 Ad Executive 14 15 Flight Attendant 15 16 Boxer;Charity Organizer;Yoga Teacher/Life Coac... 16 17 Physics Teacher 17 18 Cattle Rancher 18 19 Pharmaceutical Sales 19 21 Student 20 22 Construction Worker 21 23 Medical Student;Healthcare Consultant 22 24 Bridal Shop Owner;Interior Designer 23 25 Sex Therapist 24 26 Harvard Law Student 25 27 Former Professional Cyclist/Missionary;Shop Ma... 26 28 Police Officer 27 29 Crossfit Coach/Physical Therapy Student;CrossF... 28 30 Oil Driller 29 31 Cambridge Firefighter 30 32 Bartender;Business Development Manager 31 33 Homeless Shelter Manager;Keynote Speaker and Host 32 34 Police Officer 33 35 Marine;Real Estate/Stay-at-home Dad 34 36 Furniture Company Owner;Furniture Designer 35 37 Public Defender;Attorney 36 38 District Sales Manager 37 39 4th Grade Teacher 38 41 Communications Manager
Q11: Who played the most total number of days of Survivor? If a contestant appeared on more than one season, you would add their total days for each season together. Save the top five contestants in terms of total days played as a DataFrame and call it Q11, sorted in descending order by total days played.
The following columns should be included: castaway_id, full_name, and total_days_played where total_days_played is the sum of all days a contestant played. The index should go from 0 to n-1.
Note: Be careful because on some seasons, the contestant was allowed to come back into the game after being voted off. Take a look at Season 23's contestant Oscar Lusth in the castaways DataFrame as an example. He was voted out 7th and then returned to the game. He was then voted out 9th and returned to the game a second time. He was then voted out 17th the final time. Be aware of this in your calculations and make sure you are counting the days according to the last time they were voted off or won.
Q11 = castaways.groupby(['season','castaway_id','full_name'],as_index=False)['day'].max().sort_values(by='day',ascending=False)
Q11 = Q11.groupby(['castaway_id','full_name'],as_index=False)['day'].sum().sort_values(by='day',ascending=False)
Q11 = Q11.rename(columns={'day':'total_days_played'})
print(Q11)
castaway_id full_name total_days_played 55 55 Rob Mariano 131 198 197 Parvati Shallow 130 202 201 Oscar Lusth 128 179 179 Cirie Fields 121 112 112 Sandra Diaz-Twine 110 .. ... ... ... 133 133 Jolanda Jones 3 0 1 Sonja Christopher 3 131 131 Jonathan Libby 2 132 132 Wanda Shirk 2 195 195 Candice Cody 1 [611 rows x 3 columns]
Q12A & Q12B: What is the percentage of total extroverts and introverts that have played the game (count players only once even if they have played in more than one season). Save these percentages as Q12A and Q12B respectively. Note: Round all percentages to two decimal points and write as a float (example: 55.57).
For more information on personality types check this Wikipedia article.
total = combined1.groupby(['castaway_id','personality_type_x'],as_index=False)['season'].sum()
count_total = total['castaway_id'].count()
extroverts = total[total['personality_type_x'].str.contains('E')]
count_extroverts = extroverts['personality_type_x'].count()
Q12A = round(count_extroverts/count_total*100,2)
print(Q12A)
53.63
Q12B = 100 - Q12A
print(Q12B)
46.37
What is the percentage of total extroverts and introverts that have won the game (count players only once even if they have won more than one season)? Save these percentages as Q13A and Q13B respectively. Note: Round all percentages to two decimal points and write as a float (example: 55.57).
count_total = combined4['castaway_id'].nunique()
extroverts = combined4[combined4['personality_type_x'].str.contains('E')]
count_extroverts = extroverts['castaway_id'].nunique()
Q13A = round(count_extroverts/count_total*100,2)
print(Q13A)
61.54
Q13B = 100 - Q13A
print(Q13B)
38.46
Q14: Which contestants have never received a tribal council vote (i.e. a vote to be voted out of the game as shown in the vote_id column in the vote_history DataFrame)? Note that there are various reasons for a contestant to not receive a tribal vote: they quit, made it to the end, medical emergency, etc. Select their rows from the castaway_details DataFrame and save this as Q14 in ascending order by castaway_id. This should return a DataFrame and the index and missing values should be left as is.
not_voted = castaways.groupby('castaway_id',as_index=False)['total_votes_received'].sum()
not_voted = not_voted[not_voted['total_votes_received'] == 0]
not_voted_list = not_voted['castaway_id'].to_list()
Q14 = castaway_details[castaway_details['castaway_id'].isin(not_voted_list)]
print(Q14)
castaway_id full_name short_name date_of_birth date_of_death \
21 22 Michael Skupin Michael 1962-01-29 NaT
60 61 Paschal English Paschal 1945-03-05 NaT
79 80 Brian Heidik Brian 1968-03-09 NaT
130 131 Jonathan Libby Jonathan 1981-09-05 NaT
131 132 Wanda Shirk Wanda 1949-08-24 NaT
205 206 Gary Stritesky Gary 1951-09-16 NaT
241 242 Kathleen Sleckman Kathy 1962-08-14 NaT
282 283 Mike Borassi Mike 1947-03-13 NaT
313 314 Kelly Shinn Purple Kelly 1990-05-05 NaT
353 354 Kourtney Moon Kourtney 1982-02-27 NaT
374 375 Dana Lambert Dana 1979-12-13 NaT
384 385 Lisa Whelchel Lisa 1963-05-29 NaT
411 412 Lindsey Ogle Lindsey 1983-09-04 NaT
467 468 Neal Gottlieb Neal 1977-02-03 NaT
536 537 Pat Cusack Pat 1977-02-25 NaT
539 540 Bi Nguyen Bi 1989-10-30 NaT
552 553 Kara Kay Kara 1987-11-20 NaT
569 570 Gavin Whitson Gavin 1994-07-20 NaT
gender race ethnicity \
21 Male NaN NaN
60 Male NaN NaN
79 Male NaN NaN
130 Male NaN NaN
131 Female NaN NaN
205 Male NaN NaN
241 Female NaN NaN
282 Male NaN NaN
313 Female NaN NaN
353 Female NaN NaN
374 Female NaN NaN
384 Female NaN NaN
411 Female NaN NaN
467 Male NaN NaN
536 Male NaN NaN
539 Female Asian NaN
552 Female NaN NaN
569 Male NaN NaN
occupation personality_type
21 Software Publisher;Part-Time Professional Speaker ESFJ
60 Judge ISFJ
79 Used Car Salesman ISTP
130 Sales & Marketing Associate ISTP
131 English Teacher ENFP
205 School Bus Driver ISFJ
241 Golf Course Vendor ENFP
282 Personal Chef ESTP
313 Nursing Student ENFP
353 Motorcycle Repair ISFP
374 Cosmetologist ISTP
384 Former TV Teen Star INFP
411 Hairstylist ESFP
467 Ice Cream Entrepreneur INTP
536 Maintenance Manager ESTP
539 MMA Fighter ISFP
552 Realtor ENFJ
569 YMCA Program Director ISFJ
Q15: What contestant has won the most number of challenges? Select their row from the castaway_details DataFrame and save this as Q15. This should return a DataFrame and the index and missing values should be left as is.
x = challenge_results.groupby(['winner_id','winner'],as_index = False)['winning_tribe'].count().sort_values(by = 'winning_tribe',ascending=False)
combined6 = tribe_mapping.merge(x, how='right', left_on= 'castaway', right_on='winner')
most_wins = combined6.groupby('castaway_id',as_index=False)['winning_tribe'].max().sort_values(by='winning_tribe',ascending=False).head(1)
most_wins = most_wins['castaway_id'].to_list()
Q15 = castaway_details[castaway_details['castaway_id'].isin(most_wins)]
print(Q15)
castaway_id full_name short_name date_of_birth date_of_death gender \
200 201 Oscar Lusth Ozzy 1981-08-23 NaT Male
race ethnicity occupation \
200 Mexican American Hispanic or Latino Waiter;Photographer
personality_type
200 ISFP
Q16: What challenge has been played the most number of times in all seasons? To make it easier, a challenge should only count one time for each season it was played even if it was played more than once during a season (although I don't think that is the case for any of these challenges). Select their row from the challenge_description DataFrame and save this as Q16. This should return a DataFrame and the index and missing values should be left as is.
challenge_results['challenge_id'].value_counts()
CH0631 157
CH0004 84
CH0388 75
CH0765 73
CH0430 70
...
CH0462 1
CH0764 1
CH0695 1
CH0361 1
CH0606 1
Name: challenge_id, Length: 506, dtype: int64
Q16 = challenge_description[challenge_description['challenge_id'] == 'CH0631']
print(Q16)
challenge_id challenge_name puzzle race precision endurance \
548 CH0631 Blind Leading the Blind False True False False
strength turn_based balance food knowledge memory fire water
548 False False False False False False False False
Q17: Let's see if the use of hidden immunity idols has increased or decreased over the seasons. Create a Series of the number of hidden idols held per season. The season number should be the index and the values should be the sum of the number of idols that were held. Save this as Q17, sorted by season in ascending order.
Q17 = hidden_idols.notna()
Q17 = hidden_idols.groupby('season')['idols_held'].sum()
print(Q17)
season 11 1 12 1 13 1 14 4 15 3 16 4 17 4 18 3 19 4 20 11 21 4 22 3 23 2 24 3 25 3 26 7 27 3 28 6 29 5 30 3 31 4 32 5 33 7 34 8 35 10 36 9 37 7 38 8 39 13 40 10 Name: idols_held, dtype: int64
Q18: Which contestant held the most number of hidden immunity idols in a single season? Select their row from the castaway_details DataFrame and save this as Q18. This should return a DataFrame and the index and missing values should be left as is.
hidden_idols.groupby(['season','castaway_id'],as_index = False)['idols_held'].sum().sort_values(by = 'idols_held',ascending = False).head(1)
| season | castaway_id | idols_held | |
|---|---|---|---|
| 102 | 38 | 560 | 4 |
Q18 = castaway_details[castaway_details['castaway_id'] == 560]
print(Q18)
castaway_id full_name short_name date_of_birth date_of_death gender \
559 560 Rick Devens Rick 1984-04-05 NaT Male
race ethnicity occupation personality_type
559 NaN NaN Morning News Anchor ENTP
Q19: What was the largest number of days between when a hidden immunity idol was found and played. Don't count instances with missing values in days found or the days played column. Save the largest number of days as Q19 (as an int).
highest_days = hidden_idols[hidden_idols['day_found'].notna()]
highest_days = hidden_idols[hidden_idols['day_played'].notna()]
highest_days = highest_days.copy()
highest_days['difference'] = hidden_idols['day_played'] - hidden_idols['day_found']
Q19 = highest_days['difference'].max()
Q19 = int(Q19)
print(Q19)
33
Q20: Let's find out which finalist received zero votes from the jury (remember the jury votes are good -- you want jury votes to win the game but these players did not receive any votes).
jury_votes DataFrame grouped by season, then finalist id. Q20.votes = jury_votes.groupby(['season','finalist_id'],as_index=False)['vote'].sum()
votes = votes[votes['vote'] == 0]
Q20 = votes.groupby(['season','finalist_id'])['vote'].sum()
print(Q20)
season finalist_id
13 200 0
14 219 0
220 0
17 263 0
18 280 0
19 299 0
20 300 0
21 319 0
22 336 0
23 352 0
24 369 0
26 346 0
396 0
27 10 0
31 419 0
421 0
32 476 0
33 496 0
497 0
34 364 0
36 534 0
37 554 0
38 569 0
39 588 0
40 478 0
41 597 0
Name: vote, dtype: int64
Q21: Let's see how many winners ended up getting unanimous jury votes to win the game. Create a Dataframe that shows the survivors that got unanimous jury votes with these columns in the final output: season, season_name, winner_id, full_name. The DataFrame should be sorted by season and the index should go from 0 to n-1. Save this as Q21.
unanimous_list = ['10-0-0','7-0','8-0-0','9-0-0']
Q21 = season_summary[season_summary['final_vote'].isin(unanimous_list)]
Q21 = Q21[['season','season_name','winner_id','full_name']].reset_index(drop=True)
print(Q21)
season season_name winner_id full_name 0 14 Survivor: Fiji 221 Earl Cole 1 18 Survivor: Tocantins 281 James Thomas Jr. 2 26 Survivor: Caramoan 348 John Cochran 3 31 Survivor: Cambodia 433 Jeremy Collins 4 33 Survivor: Millennials vs. Gen X 498 Adam Klein
Q22: Sometimes a contestant might win the game even though they have a lot of other contestants trying to eliminate them. What survivor that won their season had the most votes against them to get voted out during the season (represented as "total_votes_received" from the sole_survivor DataFrame). Select their row from the castaway_details DataFrame and save this as Q22. This should return a DataFrame and the index and missing values should be left as is.
### ENTER CODE HERE ###
combined4.groupby('castaway_id',as_index=False)['total_votes_received'].max().sort_values(by='total_votes_received',ascending=False).head(1)
| castaway_id | total_votes_received | |
|---|---|---|
| 33 | 516 | 11 |
Q22 = castaway_details[castaway_details['castaway_id'] == 516]
print(Q22)
castaway_id full_name short_name date_of_birth date_of_death \
515 516 Ben Driebergen Ben 1983-01-01 NaT
gender race ethnicity occupation \
515 Male NaN NaN Marine;Real Estate/Stay-at-home Dad
personality_type
515 ESFP
Q23: Let's see how many times each country was used as a location. Create a Series where the country name is the index and the total number of times a season was played in that country are the values. Sort in descending order and call this Q23.
Q23 = season_summary.groupby('country')['season'].count()
print(Q23)
country Australia 1 Brazil 2 Cambodia 2 China 1 Fiji 10 Gabon 1 Guatemala 1 Islands 1 Kenya 1 Malaysia 1 Nicaragua 6 Palau 2 Panama 3 Philippines 4 Polynesia 1 Samoa 2 Thailand 1 Vanuatu 1 Name: season, dtype: int64
Q24: For the castaway_details DataFrame, there is a full_name column and a short_name column. It would be helpful for future analysis to have the contestants first and last name split into separate columns. First copy the castaway_details DataFrame to a new DataFrame called Q24 so that we do not change the original DataFrame.
Create two new columns and add the contestant's first name to a new column called first_name and their last name to a new column called last_name. Add these columns to the Q24 DataFrame. Put the first_name and last_name columns between the full_name and short_name columns.
Note: Be careful as some players have last names with multiple spaces. For example, Lex van den Berghe. You should code Lex as his first name and van den Berghe as his last name.
Q24 = castaway_details.copy()
Q24[['first_name', 'last_name']] = Q24['full_name'].str.split(" ", 1, expand=True)
Q24 = Q24[['castaway_id', 'full_name', 'first_name', 'last_name', 'short_name','date_of_birth','date_of_death','gender','race'
,'ethnicity','occupation','personality_type']]
print(Q24)
castaway_id full_name first_name last_name short_name \
0 1 Sonja Christopher Sonja Christopher Sonja
1 2 B.B. Anderson B.B. Anderson B.B.
2 3 Stacey Stillman Stacey Stillman Stacey
3 4 Ramona Gray Ramona Gray Ramona
4 5 Dirk Been Dirk Been Dirk
.. ... ... ... ... ...
603 604 Tiffany Seely Tiffany Seely Tiffany
604 605 Sydney Segal Sydney Segal Sydney
605 606 Shantel Smith Shantel Smith Shan
606 607 David Voce David Voce Voce
607 608 Liana Wallace Liana Wallace Liana
date_of_birth date_of_death gender race ethnicity \
0 1937-01-28 NaT Female NaN NaN
1 1936-01-18 2013-10-29 Male NaN NaN
2 1972-08-11 NaT Female NaN NaN
3 1971-01-20 NaT Female Black NaN
4 1976-06-15 NaT Male NaN NaN
.. ... ... ... ... ...
603 1973-12-08 NaT Female White Jewish
604 1995-07-19 NaT Female White Jewish
605 1987-03-11 NaT Female Black NaN
606 1986-05-01 NaT Male NaN NaN
607 2000-10-25 NaT Female Black Jewish
occupation personality_type
0 Musician ENFP
1 Real Estate Developer ESTJ
2 Attorney ENTJ
3 Biochemist/Chemist ISTJ
4 Dairy Farmer ISFP
.. ... ...
603 Teacher ENTP
604 Law Student ESTP
605 Pastor ENFJ
606 Neurosurgeon ENTJ
607 College Student ESTJ
[608 rows x 12 columns]
Q24[Q24['full_name'] == 'Lex van den Berghe']
| castaway_id | full_name | first_name | last_name | short_name | date_of_birth | date_of_death | gender | race | ethnicity | occupation | personality_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 45 | 46 | Lex van den Berghe | Lex | van den Berghe | Lex | 1963-06-18 | NaT | Male | NaN | NaN | Marketing Manager | ENTJ |
Q25: Let's say that we have a theory that contestants that stay in their original tribe when the tribes are swapped have a better chance at winning their season.
sole_survivor DataFrame that you created earlier, drop any castaways that have a missing value in the "swapped tribe" column. Use a copy of the data so that you do not drop rows from the original DataFrame.Q25A.Q25B.Note: In reality, it is more complicated than this as you would really want to see how many contestants ended up with the majority of their original tribe on the swapped tribe regardless of whether their tribe stayed the same. You are welcome to research this on your own for further practice.
Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment. Make sure you use copy() if needed.
cond = sole_survivor.copy()
cond = cond[cond['swapped_tribe'].notna()]
Q25A = cond[cond['swapped_tribe'] == cond['original_tribe']]
print(Q25A)
season_name season full_name castaway_id \
714 Survivor: Africa 3 Ethan Zohn 48
666 Survivor: The Amazon 6 Jenna Morasca 96
648 Survivor: Pearl Islands 7 Sandra Diaz-Twine 112
630 Survivor: All-Stars 8 Amber Brkich 27
612 Survivor: Vanuatu 9 Chris Daugherty 130
503 Survivor: China 15 Todd Herzog 237
465 Survivor: Gabon 17 Robert Crowley 265
389 Survivor: Nicaragua 21 Jud Birza 321
331 Survivor: One World 24 Kim Spradlin 371
293 Survivor: Caramoan 26 John Cochran 348
234 Survivor: San Juan del Sur 29 Natalie Anderson 442
216 Survivor: Worlds Apart 30 Mike Holloway 460
196 Survivor: Cambodia 31 Jeremy Collins 433
100 Survivor: Ghost Island 36 Wendell Holland 536
40 Survivor: Island of the Idols 39 Tommy Sheehan 590
18 Survivor: Winners at War 40 Tony Vlachos 424
castaway age city state personality_type \
714 Ethan 27 Lexington Massachusetts ISFP
666 Jenna 21 Bridgeville Pennsylvania ISTP
648 Sandra 29 Fort Lewis Washington ESTP
630 Amber 25 Beaver Pennsylvania ISFP
612 Chris 33 South Vienna Ohio ENTP
503 Todd 22 Pleasant Grove Utah ENFP
465 Bob 57 South Portland Maine INTP
389 Fabio 21 Venice California ESFP
331 Kim 29 San Antonio Texas INFJ
293 Cochran 25 Washington D.C. INTP
234 Natalie 28 Edgewater New Jersey ESTP
216 Mike 38 North Richland Hills Texas ESTP
196 Jeremy 37 Foxboro Massachusetts ESTJ
100 Wendell 33 Philadelphia Pennsylvania INFJ
40 Tommy 26 Long Beach New York ENFJ
18 Tony 45 Allendale New Jersey ESTP
episode day order result jury_status original_tribe \
714 15 39 16 Sole Survivor NaN Boran
666 15 39 16 Sole Survivor NaN Jaburu
648 15 39 18 Sole Survivor NaN Drake
630 17 39 18 Sole Survivor NaN Chapera
612 15 39 18 Sole Survivor NaN Lopevi
503 15 39 16 Sole Survivor NaN Fei Long
465 14 39 18 Sole Survivor NaN Kota
389 16 39 20 Sole Survivor NaN La Flor
331 15 39 18 Sole Survivor NaN Salani
293 15 39 20 Sole Survivor NaN Bikal
234 15 39 18 Sole Survivor NaN Hunahpu
216 15 39 18 Sole Survivor NaN Escameca
196 15 39 20 Sole Survivor NaN Bayon
100 14 39 20 Sole Survivor NaN Naviti
40 14 39 20 Sole Survivor NaN Vokai
18 15 39 22 Sole Survivor NaN Dakal
swapped_tribe swapped_tribe_2 merged_tribe total_votes_received \
714 Boran NaN Moto Maji 0
666 Jaburu NaN Jacaré 3
648 Drake NaN Balboa 0
630 Chapera Chapera Chaboga Mogo 6
612 Lopevi NaN Alinta 3
503 Fei Long NaN Hae Da Fung 5
465 Kota Kota Nobag 2
389 La Flor NaN Libertad 2
331 Salani NaN Tikiano 3
293 Bikal NaN Enil Edam 0
234 Hunahpu NaN Huyopa 0
216 Escameca NaN Merica 4
196 Bayon Bayon Orkun 3
100 Naviti Yanuya Lavita 5
40 Vokai NaN Lumuwaku 2
18 Dakal NaN Koru 0
immunity_idols_won
714 4
666 7
648 3
630 6
612 6
503 4
465 8
389 8
331 6
293 7
234 6
216 7
196 7
100 8
40 6
18 9
Q25B = cond[cond['swapped_tribe'] != cond['original_tribe']]
print(Q25B)
season_name season full_name \
698 Survivor: Marquesas 4 Vecepia Towery
574 Survivor: Guatemala 11 Danni Boatwright
558 Survivor: Panama 12 Aras Baskauskas
538 Survivor: Cook Islands 13 Yul Kwon
519 Survivor: Fiji 14 Earl Cole
483 Survivor: Micronesia 16 Parvati Shallow
313 Survivor: Philippines 25 Denise Stapley
270 Survivor: Blood vs. Water 27 Tyson Apostol
252 Survivor: Cagayan 28 Tony Vlachos
178 Survivor: Kaoh Rong 32 Michele Fitzgerald
158 Survivor: Millennials vs. Gen X 33 Adam Klein
138 Survivor: Game Changers 34 Sarah Lacina
120 Survivor: Heroes vs. Healers vs. Hustlers 35 Ben Driebergen
80 Survivor: David vs. Goliath 37 Nick Wilson
castaway_id castaway age city state personality_type \
698 64 Vecepia 36 Hayward California ISTJ
574 166 Danni 29 Tonganoxie Kansas ENFJ
558 182 Aras 24 Santa Monica California INFP
538 202 Yul 31 San Mateo California INTJ
519 221 Earl 35 Santa Monica California INFJ
483 197 Parvati 25 Los Angeles California ENFJ
313 386 Denise 41 Cedar Rapids Iowa INFJ
270 274 Tyson 34 Provo Utah ESTP
252 424 Tony 39 Jersey City New Jersey ESTP
178 478 Michele 24 Freehold New Jersey ESFP
158 498 Adam 25 San Francisco California ENTP
138 414 Sarah 32 Marion Iowa ENTJ
120 516 Ben 34 Boise Idaho ESFP
80 556 Nick 27 Williamsburg Kentucky ENTP
episode day order result jury_status original_tribe \
698 15 39 16 Sole Survivor NaN Maraamu
574 15 39 18 Sole Survivor NaN Nakúm
558 16 39 16 Sole Survivor NaN Viveros
538 16 39 20 Sole Survivor NaN Puka Puka
519 15 39 19 Sole Survivor NaN Ravu
483 15 39 20 Sole Survivor NaN Malakal
313 15 39 18 Sole Survivor NaN Matsing
270 15 39 23 Sole Survivor NaN Galang
252 14 39 18 Sole Survivor NaN Aparri
178 15 39 18 Sole Survivor NaN Gondol
158 14 39 20 Sole Survivor NaN Vanua
138 14 39 20 Sole Survivor NaN Nuku
120 14 39 18 Sole Survivor NaN Levu
80 14 39 20 Sole Survivor NaN David
swapped_tribe swapped_tribe_2 merged_tribe total_votes_received \
698 Rotu NaN Soliantu 2
574 Yaxhá NaN Xhakúm 1
558 Casaya NaN Gitanos 9
538 Aitutaki Aitutaki Aitutonga 5
519 Moto NaN Bula Bula 1
483 Airai NaN Dabu 4
313 Kalabaw NaN Dangrayne 6
270 Tadhana NaN Kasama 2
252 Solana NaN Solarrion 5
178 Chan Loh NaN Dara 1
158 Takali NaN Vinaka 6
138 Tavua NaN Maku Maku 0
120 Yawa NaN Solewa 11
80 Jabeni NaN Kalokalo 0
immunity_idols_won
698 4
574 5
558 4
538 6
519 5
483 7
313 1
270 8
252 6
178 7
158 6
138 4
120 4
80 4
Q26: Let's say that we wanted to predict a contestants personality type based on the information in the data files. Your task is to create a DataFrame that lists the castaway_id, full_name and personality_type for each castaway contestant. However, since most machine learning algorithms use numeric data, you want to change the personality types to the following numbers:
Save this new DataFrame as Q26 and sort based on castaway_id in ascending order.
Q26 = castaway_details.copy()
Q26 = Q26[['castaway_id','full_name','personality_type']]
Q26["personality_type"].replace({"ISTJ": 1, "ISTP": 2, "ISTP": 2, "ISFJ": 3, "ISFP": 4, "INFJ": 5, "INFP": 6,
"INTJ": 7, "INTP": 8, "ESTP": 9, "ESTJ": 10, "ESFP": 11, "ESFJ": 12, "ENFP": 13,
"ENFJ": 14, "ENTP": 15, "ENTJ": 16}, inplace=True)
Q26['personality_type'] = Q26['personality_type'].fillna(17)
Q26['personality_type'] = Q26['personality_type'].astype('int')
print(Q26)
castaway_id full_name personality_type 0 1 Sonja Christopher 13 1 2 B.B. Anderson 10 2 3 Stacey Stillman 16 3 4 Ramona Gray 1 4 5 Dirk Been 4 .. ... ... ... 603 604 Tiffany Seely 15 604 605 Sydney Segal 9 605 606 Shantel Smith 14 606 607 David Voce 16 607 608 Liana Wallace 10 [608 rows x 3 columns]
Q27: After thinking about this some more, you realize that you don't want to code the personality traits as you did in problem 26 since the data is not ordinal. Some machine learning algorithms will assume that numbers close to each other are more alike than those that are away from each other and that is not the case with these personality types.
Let's create a new DataFrame called Q27 that creates dummy columns (using get_dummies) for the original personality type column. Add a prefix called "type" and drop the first column to help prevent multicollinearity. The columns should be castaway_id, full_name followed by the various dummy columns for the personality types. Don't worry about any missing values in this step.
Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment. Make sure you use copy() if needed.
Q27 = castaway_details.copy()
Q27 = Q27[['castaway_id','full_name','personality_type']]
Q27 = pd.get_dummies(data=Q27,columns=['personality_type'], prefix='type',drop_first = True)
print(Q27)
castaway_id full_name type_ENFP type_ENTJ type_ENTP \
0 1 Sonja Christopher 1 0 0
1 2 B.B. Anderson 0 0 0
2 3 Stacey Stillman 0 1 0
3 4 Ramona Gray 0 0 0
4 5 Dirk Been 0 0 0
.. ... ... ... ... ...
603 604 Tiffany Seely 0 0 1
604 605 Sydney Segal 0 0 0
605 606 Shantel Smith 0 0 0
606 607 David Voce 0 1 0
607 608 Liana Wallace 0 0 0
type_ESFJ type_ESFP type_ESTJ type_ESTP type_INFJ type_INFP \
0 0 0 0 0 0 0
1 0 0 1 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
.. ... ... ... ... ... ...
603 0 0 0 0 0 0
604 0 0 0 1 0 0
605 0 0 0 0 0 0
606 0 0 0 0 0 0
607 0 0 1 0 0 0
type_INTJ type_INTP type_ISFJ type_ISFP type_ISTJ type_ISTP
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 1 0
4 0 0 0 1 0 0
.. ... ... ... ... ... ...
603 0 0 0 0 0 0
604 0 0 0 0 0 0
605 0 0 0 0 0 0
606 0 0 0 0 0 0
607 0 0 0 0 0 0
[608 rows x 17 columns]
Q28: After running your data above through your machine learning model, you determine that a better prediction might come from breaking the personality type into its four parts (one part for each character in the type). Your task is now to create a DataFrame called Q28 that splits the personality type into the various parts and creates a new column for each part (these columns should be called interaction that will represent the first letter in the personality type, information for the second letter, decision for the third, and organization for the fourth).
Again, since most machine learning algorithms work with numeric data, perform the following on the four new columns:
interaction --> code all I's as 0 and E's as 1information --> code all S's as 0 and N's as 1decision --> code all T's as 0 and F's as 1organization --> code as J's with 0 and P's as 12For example, if a contestant's personality type was ENTJ, your columns for that row would be:
1 for interaction because of the E1 for information because of the N0 for decision because of the T 0 for organization because of the JThe new DataFrame should be sorted in castaway_id order and have the following columns in this order: castaway_id, full_name, personality_type, interaction, information, decision, organization.
Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment. Make sure you use copy() if needed.
Q28 = castaway_details.copy()
Q28 = Q28[['castaway_id','full_name','personality_type']]
Q28[['0','interaction', 'information', 'decision', 'organization','1']] = Q28['personality_type'].str.split('', expand=True)
Q28 = Q28.drop(['0','1'], axis = 1)
Q28["interaction"].replace({"E": '1', "I": '0'}, inplace=True)
Q28["information"].replace({"N": '1', "S": '0'}, inplace=True)
Q28["decision"].replace({"F": '1', "T": '0'}, inplace=True)
Q28["organization"].replace({"P": '1', "J": '0'}, inplace=True)
print(Q28)
castaway_id full_name personality_type interaction information \
0 1 Sonja Christopher ENFP 1 1
1 2 B.B. Anderson ESTJ 1 0
2 3 Stacey Stillman ENTJ 1 1
3 4 Ramona Gray ISTJ 0 0
4 5 Dirk Been ISFP 0 0
.. ... ... ... ... ...
603 604 Tiffany Seely ENTP 1 1
604 605 Sydney Segal ESTP 1 0
605 606 Shantel Smith ENFJ 1 1
606 607 David Voce ENTJ 1 1
607 608 Liana Wallace ESTJ 1 0
decision organization
0 1 1
1 0 0
2 0 0
3 0 0
4 1 1
.. ... ...
603 0 1
604 0 1
605 1 0
606 0 0
607 0 0
[608 rows x 7 columns]
Q29: Using data from castaways, create a DataFrame called Q29 that bins the contestant ages (their age when they were on the season, not their current age) into the following age categories:
The final DataFrame should have the following columns in this order: season, castaway_id, full_name, age, and age_category. The DataFrame should be sorted by age and then castaway_id. The index should be 0 through n-1. You should have the same amount of rows as in the castaways DataFrame.
Remember: Don't change any of the original DataFrames or CodeGrade will not work correctly for this assignment. Make sure you use copy() if needed.
bins = [17,24,34,44,54,64,100]
group_names = ["18-24",'25-34','35-44','45-54','55-64','65+']
Q29 = castaways.copy()
Q29['age_category'] = pd.cut(Q29['age'], bins, labels = group_names)
Q29 = Q29[['season','castaway_id','full_name','age','age_category']].sort_values(by = ['age','castaway_id']).reset_index(drop = True)
print(Q29)
season castaway_id full_name age age_category 0 33 491 Will Wahl 18 18-24 1 36 528 Michael Yerger 18 18-24 2 18 270 Spencer Duhm 19 18-24 3 22 336 Natalie Tenerelli 19 18-24 4 23 350 Brandon Hantz 19 18-24 .. ... ... ... ... ... 757 24 366 Greg Smith 64 55-64 758 21 304 Jimmy Johnson 67 65+ 759 32 474 Joseph Del Campo 71 65+ 760 1 14 Rudy Boesch 72 65+ 761 8 14 Rudy Boesch 75 65+ [762 rows x 5 columns]
Q30: Based on the age categories you created above, what are the normalized percentages for the various age categories using value_counts(). Sort the value counts by index. Save this as Q30.
Q30= Q29['age_category'].value_counts(normalize=True).sort_index()
print(Q30)
18-24 0.190289 25-34 0.437008 35-44 0.211286 45-54 0.124672 55-64 0.031496 65+ 0.005249 Name: age_category, dtype: float64
Q31: Which contestant(s) played a perfect game? A perfect game is considered when the contestant:
Save this DataFrame as Q31 with the following columns: season_name, season, castaway_id, full_name, tribal_council_votes, jury_votes. The DataFrame should be sorted by season and the index should be 0 to n-1.
unanimous_season = [14,18,26,31,33]
perfect_game = combined4[(combined4['total_votes_received'] == 0)]
Q31 = perfect_game[perfect_game['season'].isin(unanimous_season)]
Q31 = Q31.merge(season_summary, how='inner', left_on= 'season', right_on='season')
Q31 = Q31[['season_name_x','season','castaway_id','full_name_x','total_votes_received','final_vote']]
Q31 = Q31.rename(columns={'season_name_x':'season_name','full_name_x':'full_name','total_votes_received':'tribal_council_votes','final_vote':'jury_votes'})
Q31['jury_votes'] = Q31['jury_votes'].str.split('-').str[0]
print(Q31)
season_name season castaway_id full_name \ 0 Survivor: Tocantins 18 281 James Thomas Jr. 1 Survivor: Caramoan 26 348 John Cochran tribal_council_votes jury_votes 0 0 7 1 0 8